About the project

Course description

This is a course on Open Data Science.

Github
My Github repository can be found from here


Regression and model validation

Reading the data in to R

Load GGally (and ggplot2), set the working directory to the IODS folder and read in the local tab separated copy of the file

library(GGally)
## Loading required package: ggplot2
setwd("~/Work/IODS-project")
analysis2014 <- read.table("data/analysis2014.txt", sep="\t", header=TRUE)

The structure and the dimensions of the data

str(analysis2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
dim(analysis2014)
## [1] 166   7

The data has 166 rows and 7 columns. The columns show the age, gender, attitude and points of the 166 students.
In addition the data frame has three combined scores for deep learning, structured learning and strategic learning.

Data overview

ggpairs(analysis2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

summary(analysis2014)
##       Age           Attitude         Points      gender       deep      
##  Min.   :17.00   Min.   :14.00   Min.   : 7.00   F:110   Min.   :1.583  
##  1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   M: 56   1st Qu.:3.333  
##  Median :22.00   Median :32.00   Median :23.00           Median :3.667  
##  Mean   :25.51   Mean   :31.43   Mean   :22.72           Mean   :3.680  
##  3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75           3rd Qu.:4.083  
##  Max.   :55.00   Max.   :50.00   Max.   :33.00           Max.   :4.917  
##       surf            stra      
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:2.417   1st Qu.:2.625  
##  Median :2.833   Median :3.188  
##  Mean   :2.787   Mean   :3.121  
##  3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.333   Max.   :5.000

Most of the variables have a close to normal distribution, except age that is closer to poisson distribution. There’s twice as much women in the data set. The strongest correlation can be seen between Attitude and Points.

Regression model

lm1 <- lm(Points~Age+gender+Attitude, data=analysis2014)
summary(lm1)
## 
## Call:
## lm(formula = Points ~ Age + gender + Attitude, data = analysis2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## Attitude     0.36066    0.05932   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08
lm2 <- lm(Points~Attitude, data=analysis2014)
summary(lm2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = analysis2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In the model with age, gender and attitude as explanatory variables, only attitude was significantly affecting the points. Age or gender do not affect the points. In the final model the attitude has significant positive correlation with points

Model diagnostics

par(mfrow=c(2,2))
plot(lm2, which = c(1,2,5))

It can be seen from the diagnostic plots for the second model that there’s no pattern in the residuals, so we have constant variance. There are some possible outliers as shown in the plot. The residuals have a fairly normal distribution, again excluding the some possible outliers. Also the leverage plot shows that few points could be outliers in this data set.


Logistic regression

The libraries needed in the analyses

library(GGally)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read in the data

We start by reading in the data that was saved after the data wrangling part.

alc <- read.table("data/alc_data.txt", sep="\t", header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data consists of different attributes of students from two Portugese schools. Some of the data points in the two sets are by the same students and the data sets have been joined based on their answers. Only the students included in both questionnaires are included.

Choose the variables and explore them

The four variables I think could be affected or linked with high or low alcohol consumption are:

alc_rf <- randomForest(high_use ~ ., data=alc)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
varImpPlot(alc_rf)

  • sex (males tend to drink more than females)
  • abcenses (The more you drink, the more you might be absent)
  • G3, the final grade (The consumption of alcohol could affect the final grade from the course)
  • goout, going out with friends (What else would you do with friends)
ggpairs(alc[,c("high_use", "sex", "absences", "G3", "goout")],  aes(col=high_use, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

table(high_use = alc$high_use, sex = alc$sex)
##         sex
## high_use   F   M
##    FALSE 156 112
##    TRUE   42  72
boxplot(absences~high_use, data=alc, ylab="Number of absences", xlab="High use")

boxplot(G3~high_use, data=alc, ylab="Final grade", xlab="High use")

boxplot(goout~high_use, data=alc, ylab="Number of times going out w/ friends", xlab="High use")

There seems to be slightly more heavy users in male. Also the heavy users seem to have more absences, but their grades do not differ. Probably the strongest association can be seen in the times going out with friends. The more you go out with friends, the more likely you are a heavy user.

Logistic model

m1 <- glm(high_use ~ sex + absences + G3 + goout, data = alc, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ sex + absences + G3 + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9286  -0.8055  -0.5308   0.8264   2.4649  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.57370    0.68720  -5.200 1.99e-07 ***
## sexM         0.95202    0.25496   3.734 0.000188 ***
## absences     0.08215    0.02237   3.672 0.000241 ***
## G3          -0.04453    0.03883  -1.147 0.251571    
## goout        0.70844    0.12065   5.872 4.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 386.44  on 377  degrees of freedom
## AIC: 396.44
## 
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept)        sexM    absences          G3       goout 
## -3.57369722  0.95202361  0.08214567 -0.04452530  0.70844342
OR <- coef(m1) %>% exp()
CI <- confint(m1) %>% exp()
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR       2.5 %    97.5 %
## (Intercept) 0.02805195 0.006927766 0.1031581
## sexM        2.59094742 1.581095123 4.3051892
## absences    1.08561394 1.040315244 1.1370409
## G3          0.95645140 0.886134749 1.0323272
## goout       2.03082765 1.612303135 2.5901641

From the model summary and the odds ratios and their coefficients we can see that the sex, number of absences and the number of times going out with friends are all associated with the hugh use of alcohol.
Males use more alcohol, the more absences the more likely to use more alcohol, and the more times going out with friends, the more likely to use more alcohol.

Refined model and the predictive power

m2 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)

The cross tabulation from the predictive power of our model and a graphical vialization of it.

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
ggplot(alc, aes(y = high_use, x = probability, col=prediction)) + geom_point()

From the cross tabulation we can see that the predictive power is close to 0.80, meaning ~20 % are wrongly predicted. Next we can compare the predictive power to a simple guessing strategy.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(alc$high_use, prob=0)
## [1] 0.2984293
loss_func(alc$high_use, alc$probability)
## [1] 0.2094241

We can see tht the model was better than simple guessing (~0.21 error vs. ~0.30 error).
We can refine our cross validation by performing a 10-fold cross-validaton.

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cv$delta[1]
## [1] 0.2146597

We can see that this model was better than the one introduced in the DataCamp (0.26 error).
Hooray for that!


Clustering and classification

Load the data

Load the data from the MASS package and explore the structure and dimensions

library(dplyr)
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset is a data frame with 506 rows and 14 columns. The columns present different housing values in the suburbs of Boston.

Graphical overview of data

pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

There’s both continuous and binary variables in the data. Some of the varibles are highly correlated, either negatively, such as lower status of the population (lstat) and median value of owner-occupied homes in $1000 (medv), or positively, such as full-value property-tax rate per $10,000 (tax) and proportion of non-retail business acres per town (indus).
However, they are not all normally distributed or have the same variance which are the assumptions for LDA. So we need to scale the variables.

Standardization and categorize the crime rate

We scale the values to have a mean of 0 and standard deviation of 1.

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
apply(boston_scaled,2, sd, na.rm=TRUE)
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##       1       1       1       1       1       1       1       1       1 
##     tax ptratio   black   lstat    medv 
##       1       1       1       1       1
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then we use the quantile function to break the crime rates to 4 bins and categorize the crime rates to 4 different categories. Then we remove the original crime rate variuable and substitute it with the new categorial variables.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

After that we divide the data to test and train sets, with 80 % of the data going to train set. We also save the correct classes from the test set and remove the crime variable from it.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

lda.fit <- lda(crime~., data=train)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen=2, col=classes)

Predictions from the LDA

First we save the correct classess in the test set and remove it from the data. Then we predict the crime rate in the test set using the model from ther train set and cross-tabulate it with the correct classes.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata=test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       8        3    0
##   med_low    3      18        4    0
##   med_high   0       7       10    2
##   high       0       0        0   29

The model predicted quite well the crime categories. Probably with fewer classes, e.g. 3, the result would have been even better.

Euclidean distance and k-means clustering

data("Boston")
boston_scaled <- scale(Boston)
boston_euk_dist <- dist(boston_scaled)

Then we can run the k-means clustering with 3 clusters and visualise the result with few relevant variables.

km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[,6:10], col = km$cluster)

It seems that 3 clusters might not be the best, so we can investigate the optimal number of clusters.

set.seed(123)

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

It seems that 2 is the most optimal number of clusters, so let’s visualise that on the same variables.

km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,6:10], col = km$cluster)

Looks better.

Bonus

Reload the dataset and standardize it.

data("Boston")
boston_scaled <- as.data.frame(scale(Boston))

K-means clustering with 5 clusters and LDA on these 5 clusters.

km <- kmeans(boston_scaled, centers=5)
boston_scaled <- mutate(boston_scaled, cluster=km$cluster)
lda_boston <- lda(cluster ~., data=boston_scaled)

Plot the results.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda_boston, dimen=2, col=boston_scaled$cluster)
lda.arrows(lda_boston, myscale = 2, tex=1)

The variables having the bggest impact are taxand radseparating clusters 1 and 2 from the others. From the remaining variables zn separates clusters 3, 4 and 5 from each other.


Dimensionality reduction techniques

Load the data and linraries needed

library(dplyr)
library(GGally)
human <- read.table("data/human.txt", row.names = 1, header=TRUE, sep="\t")
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Pop2EduF  : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ LabFperM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ ExpEdu    : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ LifeExp   : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI       : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MatMort   : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ AdolBirth : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ PercInParl: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Data frame looks as it should look.

Graphical overview

ggpairs(human)

summary(human)
##     Pop2EduF         LabFperM          ExpEdu         LifeExp     
##  Min.   :  0.90   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.: 27.15   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median : 56.60   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   : 55.37   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.: 85.15   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :100.00   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            MatMort         AdolBirth        PercInParl   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Most of the variables are close to normally distributed, except maternal mortality, GNI and adolescent birth rate which resemble more poisson distribution. There clear correlations between some of the variables, such as expected education and life expectancy ansd maternal mortality and adolescent birth rate. The variance in GNI is very large compared to the other variables.

PCA on unscaled data

pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Due to the large variation in GNI, we need to scale the data to get some meaningful results.

Standardized data and PCA

human_scaled <- scale(human)
pca_human <- prcomp(human_scaled)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 56.1 16.5  9.9  6.8  3.8  2.9  2.5  1.4
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

The scaling definitely made a difference, the differences between the country are not only defined based on the differences on the GNI index.
Most of the variation (PC1, 56.1 %) can be explained with the educational status of the population, especially of the women, life expectance and GNI, that negatively correlate with maternal mortality and adolescent birth rate. The second axis (PC2, 16.5 %) shows the difference of women working and their percentage in parlament.

Tea dataset

library(FactoMineR)
library(tidyr)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
gather(tea) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

We’ll select few of the variables for the Multiple Correspondence Analysis.

keep_columns <- c("home", "friends", "Tea", "how", "sex")
tea_time <- dplyr::select(tea, one_of(keep_columns))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.269   0.230   0.216   0.199   0.178   0.173
## % of var.             19.191  16.442  15.410  14.216  12.683  12.344
## Cumulative % of var.  19.191  35.633  51.043  65.258  77.941  90.286
##                        Dim.7
## Variance               0.136
## % of var.              9.714
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  |  0.928  1.067  0.598 |  0.114  0.019  0.009 | -0.181
## 2                  |  0.494  0.302  0.190 |  0.041  0.002  0.001 | -0.381
## 3                  | -0.503  0.313  0.492 |  0.229  0.076  0.102 | -0.126
## 4                  |  0.414  0.213  0.183 |  0.608  0.535  0.394 | -0.218
## 5                  |  0.414  0.213  0.183 |  0.608  0.535  0.394 | -0.218
## 6                  |  0.414  0.213  0.183 |  0.608  0.535  0.394 | -0.218
## 7                  | -0.069  0.006  0.007 |  0.303  0.133  0.137 |  0.074
## 8                  |  0.494  0.302  0.190 |  0.041  0.002  0.001 | -0.381
## 9                  |  0.170  0.036  0.024 |  0.020  0.001  0.000 | -0.016
## 10                 |  0.683  0.579  0.271 | -0.474  0.325  0.130 |  0.021
##                       ctr   cos2  
## 1                   0.051  0.023 |
## 2                   0.224  0.113 |
## 3                   0.025  0.031 |
## 4                   0.073  0.050 |
## 5                   0.073  0.050 |
## 6                   0.073  0.050 |
## 7                   0.008  0.008 |
## 8                   0.224  0.113 |
## 9                   0.000  0.000 |
## 10                  0.001  0.000 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## home               |   0.001   0.000   0.000   0.050 |  -0.094   0.747
## Not.home           |  -0.016   0.001   0.000  -0.050 |   3.043  24.141
## friends            |  -0.434   9.162   0.355 -10.303 |  -0.254   3.658
## Not.friends        |   0.818  17.266   0.355  10.303 |   0.478   6.894
## black              |   0.859  13.545   0.242   8.498 |  -0.816  14.261
## Earl Grey          |  -0.471  10.640   0.401 -10.946 |   0.368   7.583
## green              |   0.831   5.650   0.085   5.050 |  -0.325   1.009
## tea bag            |   0.060   0.149   0.005   1.177 |   0.601  17.765
## tea bag+unpackaged |  -0.573   7.669   0.150  -6.698 |  -0.809  17.836
## unpackaged         |   1.216  13.213   0.202   7.766 |  -0.723   5.452
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## home                 0.286  -9.255 |  -0.110   1.087   0.391 -10.810 |
## Not.home             0.286   9.255 |   3.555  35.143   0.391  10.810 |
## friends              0.121  -6.026 |   0.234   3.330   0.104   5.566 |
## Not.friends          0.121   6.026 |  -0.442   6.276   0.104  -5.566 |
## black                0.218  -8.071 |   0.219   1.101   0.016   2.172 |
## Earl Grey            0.245   8.554 |   0.135   1.086   0.033   3.134 |
## green                0.013  -1.975 |  -1.281  16.742   0.203  -7.789 |
## tea bag              0.472  11.878 |  -0.364   6.954   0.173  -7.194 |
## tea bag+unpackaged   0.299  -9.455 |   0.106   0.324   0.005   1.233 |
## unpackaged           0.071  -4.617 |   1.442  23.146   0.284   9.211 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## home               | 0.000 0.286 0.391 |
## friends            | 0.355 0.121 0.104 |
## Tea                | 0.401 0.263 0.204 |
## how                | 0.283 0.472 0.328 |
## sex                | 0.305 0.008 0.052 |
plot(mca, invisible=c("none"), habillage = "quali")

Based on the MCA analysis, men drink green and black tea unpacked home alone.


Analysis of longitudinal data

RATS data

Read in the data and convert the IDand Group columns to factors.

library(tidyverse)
RATSL <- read.table("data/RATSL.txt", header = TRUE)
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD     <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...

The data seems to be in order.
Next to the analyses.

Non-standardized data

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

Standardization

As higher weight in the beginning seems to mean also higher weight in the end, standardization is required.
\[\ standardized(x) = \frac{x-mean(x)}{sd(x)}\]

RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
  ungroup()

ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$stdweight), max(RATSL$stdweight)), name = "standardized bprs")

Mean response for treatments

The starndard error of the mean can be calculated with the following equation: \[\ se = \frac{sd(x)}{\sqrt{n}} \]

n <- RATSL$Time %>% unique() %>% length()
RATS_trmt <- RATSL %>% group_by(Group, Time) %>% summarize(mean = mean(Weight), se = sd(Weight)/sqrt(n))
glimpse(RATS_trmt)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(RATS_trmt, aes(x=Time, y=mean, linetype = Group, shape = Group)) + 
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=2) +
  scale_shape_manual(values = c(1,2,5)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.5) +
  theme(legend.position = c(0.8,0.5), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") 

ggplot(RATSL, aes(x= as.factor(Time), y=Weight, fill=Group)) +
  geom_boxplot() +
  theme(legend.position = c(0.8,0.4), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") + scale_x_discrete(name = "Time") + 
  scale_fill_manual(values=c("black", "grey50", "white"))

Outlier in the data

RATSL8 <- RATSL %>% 
  group_by(Group, ID) %>%
  summarize(mean = mean(Weight)) %>%
  ungroup()

ggplot(RATSL8, aes(x=Group, y=mean)) + 
  geom_boxplot() + 
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") + 
  theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
  scale_y_continuous(name = "mean(Weight) per group")

There’s one outlier in the group 2. It will be removed. One outlier data point can be removed. And the data plotted again.

Without the outlier

RATSL8S1 <- filter(RATSL8, (Group=="2" & mean<500) | Group!="2")
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight) per group")

Test group differences

Seems obvious thst there’s differences between groups. Let’s still test that using analysis of variance.

RATS_lm <- aov(mean ~ Group, data = RATSL8S1)
summary(RATS_lm)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 206390  103195   483.6 3.39e-12 ***
## Residuals   12   2561     213                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected there was a difference between groups. We can do a post hoc test to see pairwise differences.

TukeyHSD(RATS_lm)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mean ~ Group, data = RATSL8S1)
## 
## $Group
##          diff       lwr      upr    p adj
## 2-1 185.73864 159.35462 212.1227 0.00e+00
## 3-1 262.07955 238.21430 285.9448 0.00e+00
## 3-2  76.34091  46.57572 106.1061 4.94e-05

There was a clear difference between all groups.

BPRS data

Read in the data and convert the treatmentand subject columns to factors.

BPRSL <- read.table("data/BPRSL.txt", header = TRUE)
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks     <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

The data seems to be in order.
Next to the analyses.

Data overlook

ggplot(BPRSL, aes(x=week, y=bprs, group=subject, shape=treatment)) + 
  geom_point() +
  theme(legend.position = "top", panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE))

ggplot(BPRSL, aes(x = week, y = bprs, fill=subject)) +
  geom_line(aes(linetype = treatment)) +
  theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) + 
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2), expand=c(0,0)) + scale_y_continuous(name = "bprss") + theme(legend.position = "top")

Linear mixed effects models for repeated measurements

First we fit a random intercept model

library(lme4)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Then we fit a random intercept and random slope model and test wheter it’s better than the random intercept only model.

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8202  8.0511        
##           week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It was better. So last we can try an interaction between week and treatment.

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0767  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0016  8.0624        
##           week         0.9688  0.9843   -0.51
##  Residual             96.4699  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

That was not better, so we’ll stick with the model witout the interaction and plot that.

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:randomForest':
## 
##     combine
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, fill = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Observed bprs") +
  theme(panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE), 
        legend.position = "top")

Fitted <- fitted(BPRS_ref1)
BPRSL <- mutate(BPRSL, fitted=Fitted)

p2 <- ggplot(BPRSL, aes(x = week, y = fitted, fill = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Fitted bprs") +
  theme(panel.grid = element_blank(), panel.background = element_blank(),
        panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE), 
        legend.position = "top")
grid.arrange(p1, p2, ncol=2)

That’s it.